Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python3, Ventana support, import option #6

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

markemus
Copy link

@markemus markemus commented Mar 5, 2020

  • Updated to python3 and tested on the supported formats.
  • Added support for the Ventana TIF format specified by openslide here: https://openslide.org/formats/ventana/
  • Wrapped the anonymization in a function to allow it to be imported into other python modules.

I'm not 100% certain that it works for the MRXS format- it runs without error on a test image but I'm not sure how to check that the result is truly de-identified.

@Tomatenbiss
Copy link

Tomatenbiss commented Sep 18, 2020

@markemus, I tested your script with a couple of slides from my institute. I am testing the results with QuPath and 3DHistech Case Viewer where you can access the label images. In case of a successful anonymization the label image cannot be found and the macro image is cropped so the label is not visible anymore.

When I use the script on .ndpi files the anonymization works, although I get the following warnings (full output):

<function do_ventana_tif at 0x000002965E7F8EE8>
TIFFReadDirectory: Warning, Unknown field with tag 65420 (0xff8c) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65421 (0xff8d) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65422 (0xff8e) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65423 (0xff8f) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65424 (0xff90) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65425 (0xff91) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65426 (0xff92) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65427 (0xff93) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65428 (0xff94) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65433 (0xff99) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65439 (0xff9f) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65440 (0xffa0) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65441 (0xffa1) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65442 (0xffa2) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65443 (0xffa3) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65444 (0xffa4) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65445 (0xffa5) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65446 (0xffa6) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65449 (0xffa9) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65455 (0xffaf) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65456 (0xffb0) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65457 (0xffb1) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65458 (0xffb2) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65420 (0xff8c) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65421 (0xff8d) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65422 (0xff8e) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65423 (0xff8f) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65424 (0xff90) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65425 (0xff91) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65426 (0xff92) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65427 (0xff93) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65428 (0xff94) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65433 (0xff99) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65439 (0xff9f) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65440 (0xffa0) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65441 (0xffa1) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65442 (0xffa2) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65443 (0xffa3) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65444 (0xffa4) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65445 (0xffa5) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65446 (0xffa6) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65449 (0xffa9) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65455 (0xffaf) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65456 (0xffb0) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65457 (0xffb1) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 65458 (0xffb2) encountered.
<function do_aperio_svs at 0x000002965E7F8D38>
<function do_hamamatsu_ndpi at 0x000002965E7F8DC8>

The script did not work with .mrsx files (old format). An Error is encountered and the slides still contain the label. This is the output (anonymized slide names):

<function do_ventana_tif at 0x0000021216498EE8>
<function do_aperio_svs at 0x0000021216498D38>
<function do_hamamatsu_ndpi at 0x0000021216498DC8>
<function do_3dhistech_mrxs at 0x0000021216498E58>
??????.mrxs: File contains no section headers.
file: Slidedat.ini', line: 1
'[GENERAL]\n'

However, in the next days I will have access to .mrxs files in the format and will report the results.

@markemus
Copy link
Author

BTW we discovered recently that the Aperio anonymization at least does NOT currently remove or modify macro images. Big problem. I've been working on a version that fixes that - it's working now and I'll push soon.

I'm not too familiar with the other file formats (other than Ventana) so I don't plan on checking them, but someone more familiar with them should go over them with a fine toothed comb and make sure that everything that needs to be removed is actually being removed. I don't want to take them on and end up missing a barcode somewhere...

@Tomatenbiss
Copy link

Tomatenbiss commented Sep 23, 2020

This is very interesting. I just ran the script (my pull request to you in which I did not touch the Aperio code) on my .svs files and it worked. In which format are your Aperio files stored? And which software do you use for testing?

@markemus
Copy link
Author

markemus commented Sep 23, 2020

def do_aperio_svs(filename):
    with TiffFile(filename) as fh:
        # Check for SVS file
        try:
            desc0 = fh.directories[0].entries[IMAGE_DESCRIPTION].value()
            if not desc0.startswith(b'Aperio'):
                raise UnrecognizedFile
        except KeyError:
            raise UnrecognizedFile
        accept(filename, 'SVS')

        # Find and delete label
        for directory in fh.directories:
            lines = directory.entries[IMAGE_DESCRIPTION].value().splitlines()
            if len(lines) >= 2 and lines[1].startswith(b'label '):
                directory.delete(expected_prefix=LZW_CLEARCODE)
                break
        else:
            raise IOError("No label in SVS file")

This is the current code for Aperio anonymization. It doesn't touch the macro image at all.

@Tomatenbiss
Copy link

I see where my "mistake" is. I did not check the macro image beforehand. On my slides the macro image does not include the label at all but contains the main part of the glass slide including an annotation of where the tissue is (which in turn is used as the thumbnail image). So for my slides there is nothing to anonymize on the macro image.. Do the macro images of your aperio slides contain the label?

@markemus
Copy link
Author

I believe they often do. I have a new version of this that removes the macro, and also removes the filename from the ImageDescription tag which can contain PHI. I'll try to push the new version today.

@markemus
Copy link
Author

markemus commented Sep 23, 2020

@Tomatenbiss I merged your Mirax fix as well - could you please run your tests again and make sure it still works for both formats? For aperio images the macro will be removed and the "Filename = (filename)" values in ImageDescription tags should now read "Filename = X".

@Tomatenbiss
Copy link

@markemus, I tested the script with my files and achieved the following results:

3DHistech (MRXS)

  • 2.2: label is deleted, macro still exists (no label in the macro image found in my files)
  • 1.9: label is deleted, macro still exists (no label in the macro image found in my files)

Aperio (SVS)

  • label and macro are deleted (although macro did not contain label in my files)

Hamamatsu (NDPI)

  • macro is deleted (in contrast to mrxs and svs the ndpi files only have the macro image where the label is included)

My next steps involve to check the metadata of my mrxs files. Do you have any experience with this?

@markemus
Copy link
Author

No, sorry, I've never dealt with mrxs before. But thank you for the test results, much appreciated!

@markemus
Copy link
Author

markemus commented Oct 1, 2020

By the way I looked into it a bit more: in Aperio, the macro image is cut off at the top so that it does not show the label. However this cut happens at an arbitrary point, and some of the label (at least our labels) is still visible. It is hard to tell because the slide is backlit when the macro is captured, which makes the label appear very dark, but it is still very possible to reveal the data with some basic image manipulation (eg histogram equalization in the dark region).

Additionally, some of our slides had labels on the lower half of the image, which could be made readable with the same method.

@fiendish
Copy link

fiendish commented Jul 9, 2021

@markemus @Tomatenbiss raising IOError on Aperio label not found and macro not found separately means that if there's no label but there is a macro, the macro won't be removed because the code will exit immediately after not finding the label. I think it would be better to just print the messages and move on rather than raise exceptions in those places.

And if you're worried about PHI, you probably should remove every field that isn't explicitly one of the image metrics, not just filename. Date is notably questionable, as remarked in issue 2. Note as well that the "h" in Originalheight appears sometimes lowercase, so probably casing of the tags is inconsistent and they should all be compared without respect to letter casing.

@fiendish
Copy link

Also, according to openslide/openslide#297 (comment) it's not safe to rely on "label" and "macro" in the description field.

@markemus
Copy link
Author

@fiendish I'm not maintaining this code anymore, however:

The exceptions exist to ensure that we can catch if phi is not removed. If the label is misidentified as a thumbnail, for example, we want to know that the anonymization failed.

Removing additional fields is probably a good idea but wasn't necessary for our lab. Certainly fields like "project name" could be an issue if they exist.

From what I've seen "label" and "macro" are consistent within openslide formats but inconsistent across formats. The info for particular formats can be found here: https://openslide.org/formats/ . Since the code handles each format separately this shouldn't cause problems.

@fiendish
Copy link

fiendish commented Jul 14, 2021

I'm not maintaining this code anymore

That's ok then. Feel free to ignore. I figured I'd tag you here just in case.

I had just wanted to add information to this thread because it's likely a place where people will look if they go searching for SVS deidentification scripts.

The exceptions exist to ensure that we can catch if phi is not removed.

The problem isn't using exceptions. The problem is where the exceptions are used.
Label and macro images are optional in the format, not required or guaranteed, so exiting early without checking both is bad if the goal is to fully redact the images rather than just give up. If for your purposes it was/is desired for the code to give up early and often, then ok, but that makes it worse at redacting. Rather than saying that an SVS just has no label frame, it will claim that an SVS without a label frame is not actually an SVS, which is wrong. For instance, if a previous version of this code removed the label but not the macro, the new version will never remove the macro because it will not see a label and will then exit early even though it should just continue and remove the macro. And then of course the description will also not be redacted because of some unrelated absent IFD that was optional in the first place.

Since the code handles each format separately this shouldn't cause problems.

Already-partially-redacted images and images generated by the Aperio GT450 are the same format as the rest. They're all valid Aperio SVS files. The GT450 ones just don't use those strings in the descriptions of the label/macro images, and Leica (allegedly, anyway, because the information is secondhand) says that looking for those strings is the wrong way to check for those frames and that they should be identified by their SUBFILETYPE (254) TIFF tag being either 1 or 9 instead of 0 as mentioned in the openslide bug report.

I suppose one could make entirely separate handlers for GT450-generated SVS files and already-partially-redacted SVS files, but this function is called do_aperio_svs not do_aperio_svs_except_for_ones_from_gt450s_or_already_partially_redacted_ones. 🙂

@websterlincoln
Copy link

websterlincoln commented Aug 16, 2021

@markemus, I tested the script with my files and achieved the following results:

3DHistech (MRXS)

  • 2.2: label is deleted, macro still exists (no label in the macro image found in my files)
  • 1.9: label is deleted, macro still exists (no label in the macro image found in my files)

Aperio (SVS)

  • label and macro are deleted (although macro did not contain label in my files)

Hamamatsu (NDPI)

  • macro is deleted (in contrast to mrxs and svs the ndpi files only have the macro image where the label is included)

My next steps involve to check the metadata of my mrxs files. Do you have any experience with this?

@Tomatenbiss Metadata in MRXS files is stored in the Slidedat.ini file, which is essentially a text file that can be modified with only a few lines of code.

@c-arthurs
Copy link

Just out of curiosity, is there any reason not to merge this? It currently looks as if it has all of the functionality of the python2 version.

@bgilbert
Copy link
Owner

@c-arthurs I haven't been actively maintaining this repo, and haven't taken the time to go through the PRs here. It might happen in the future, but isn't high on my priority list.

@Tomatenbiss
Copy link

@c-arthurs, @bgilbert : Within the EMPAIA project, we have now developed our own solution for anonymizing WSIs (in various formats). This is currently available via Gitlab . The paper for this is currently in review, the preprint can already be viewed at arXiv.

@markemus
Copy link
Author

markemus commented Nov 14, 2022

So just an update, we have been using this in production for a few years now and it's quite stable. It covers all of the PHI that we've found in our own datasets and the troublesome fields that we've encountered, but we can't guarantee that it will cover every possible use case.

There is an update I need to push that adds barcode deletion for Hamamatsu. We also have a couple of extension scripts that save the PHI to separate files as a backup as well as a basic web GUI although those may be out of scope.

As far as the future goes, I am maintaining it again for now and I have a coworker who can probably take it over in the future if it comes up again. @bgilbert , would you prefer if I made a fork? There are a lot of improvements in this version, it adds support for a new format and patches some gaps in the old formats that were letting parts of the labels, barcodes etc through, as well as moving to python3.

@Tomatenbiss , very nice to see a new anonymization tool, and the new support for phillips! It sounds quite robust and I hope to play around with it soon when I get a chance. BTW our team has also experimented with a number of standardized formats (ometiff and a custom aperio-style format of our own) if you want to compare notes some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants